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Abstract 

The paper discusses three major issues. 
First, it discusses why it makes sense to ap- 
proach problems in a hierarchical fashion. It 
defines the class of hierarchically decompos- 
able functions that can be used to test the al- 
gorithms that approach problems in this fash- 
ion. Finally, the Bayesian optimization algo- 
rithm (BOA) is extended in order to solve the 
proposed class of problems. 



1 INTRODUCTION 

Recently, the connection between human innovation 
and genetic algorithms has been discussed (Goldberg, 
2000; Holland, 1995; Koza, 1994; Koza, Bennett III, 
Andre, & Keane, 1999). There are two important im- 
plications of this result: the innovation can be thought 
of as a model of genetic algorithms and the genetic al- 
gorithms can be thought of as a model of innovation. 
Moreover, in the genetic and evolutionary computa- 
tion community there has been a growing interest in 
what we call hierarchical problem solving. 

The purpose of this paper is threefold. The paper dis- 
cusses why it makes sense to approach problems in a 
hierarchical fashion and combine promising solutions 
from lower levels to form the solutions on a higher 
level. The class of hierarchically decomposable prob- 
lems, as an extension of widely discussed, used, and an- 
alyzed additively decomposable problems, is defined. 
Finally, the Bayesian optimization algorithm (BOA) 
is extended in order to solve the described class of 
problems. 

Section 2 provides the background and motivation. 
The class of hierarchically decomposable problems is 
defined in Section 3. The Bayesian optimization algo- 
rithm is briefly described in Section 4. Section 5 dis- 



cusses possible extension of models used in the BOA to 
guide the search in order to adjust the model-building 
to hierarchical problems. The directions of future re- 
search are outlined in Section 6. The paper is summa- 
rized and concluded in Section 7. 



2 GENETIC ALGORITHMS, 
INNOVATION, AND 
HIERARCHY 

A genetic algorithm (Holland, 1975; Goldberg, 1989) 
evolves a population of potential solutions to a given 
problem. The first population of solutions is generated 
at random. By means of a measure of quality of solu- 
tions given by a user, usually expressed in the form of 
one or multiple functions, better solutions are selected 
from the current population. The selected solutions 
undergo the operators of mutation and crossover in 
order to create the population of new solutions (the 
offspring population) that fully or in part replace the 
original (parent) population. The process repeats until 
the termination criteria (e.g., convergence to a single- 
ton) given by the user are met. 

As it was argued recently (Goldberg, 2000), selection, 
crossover, and mutation are not very interesting op- 
erators when acting alone. By repeatedly applying 
selection alone, the best solution of the initial pop- 
ulation would simply overtake the entire population. 
More importantly, nothing but the small, randomly 
generated, region of the search space (the initial pop- 
ulation of solutions) , would be explored. By repeat- 
edly applying crossover alone, the final effect would be 
the one of shuffling parts of a set of randomly gener- 
ated solutions. By repeatedly applying mutation, the 
neighborhood of randomly generated solutions would 
only be explored. Thus, the effects of applying either 
operator by itself would be no better than the one of 
generating a number of solutions at random with no 



bias whatsoever. 

Important features of these operators emerge when us- 
ing a combination of these. By using the selection and 
mutation together, the initial population of solutions is 
continually improved by selecting better solutions and 
exploring their close neighborhood. By introducing 
crossover, the solutions are no longer only improved 
by slight perturbations but pieces of solutions are com- 
bined together to form new solutions. This process is 
not unlike a human cross-fertilizing innovation (Gold- 
berg, 2000). 

One of the implications of this argument is that the re- 
sults of genetic and evolutionary computation can be 
used as a tool for understanding and modeling human 
innovation. On the other hand, the achievements and 
experience from human innovation and engineering de- 
sign can be seen as yet another source of inspiration for 
genetic and evolutionary computation in order to de- 
sign methods that solve hard problems of our interest 
quickly, accurately, and reliably. This paper investi- 
gates on using hierarchical problem solving as one of 
the cornerstones of engineering design in order to im- 
prove current genetic and evolutionary optimization 
methods. 

In engineering design, the problems are often solved 
in a hierarchical fashion. New designs or ideas are 
composed of other designs or ideas without having to 
reinvent these. Many sub-parts of our new design can 
be created separately and the final result is produced 
by combining the alternatives. For example, when de- 
signing a car, the car stereo and the engine can be 
designed separately and combined together to form a 
part of a new car design. Various alternatives can be 
tried and the final choice can be done by comparing 
different combinations of reasonable car stereos and 
engines. When designing an engine, there is no need 
to reinvent the carburetor, and one can simply choose 
one from a set of reasonable carburetors that we have 
already designed. When completing the design, we can 
simply use an appropriate engine in combination with 
the remaining parts (e.g., the car stereo). To put all 
the parts together, we need not reinvent nuts and bolts 
each time we modify some part of the engine (e.g., the 
size of cylinders) but simply use some reasonable ones 
we have designed along the way. In general, higher- 
level knowledge can be obtained at much lower price 
when we approach the problem at lower level first, and 
use the results of this in order to compose higher-order 
solutions. 

Next section describes a general class of hierarchically 
decomposable functions. With this definition at hand, 
the subsequent section continues by proposing an ex- 



tension of the recently proposed Bayesian optimization 
algorithm to the class of hierarchically decomposable 
problems which are an extension of additively decom- 
posable problems discussed in our earlier work. 

In the following text, the solutions are represented by 
binary strings of a fixed length, but the results can 
be easily extended to fixed-length strings over any fi- 
nite base alphabet. Each string position represents a 
(binary) random variable and the set of promising so- 
lutions selected according to their fitness represents a 
multivariate random sample. 



3 HIERARCHICALLY 

DECOMPOSABLE FUNCTIONS 

Hierarchically decomposable functions (HDFs) were 
first presented by Goldberg (1998) who designed the 
so-called Tobacco Road Function which combined de- 
ception and multimodality up a number of levels (also 
in Goldberg (1997)). The class of hierarchically consis- 
tent functions was later presented by Watson, Hornby, 
and Pollack (1998). In HDFs, the fitness contribution 
of each building block (an intact sub-part of the so- 
lution quasi-separable from its context) is separated 
from its interpretation (meaning) when it is used as 
a building block for constructing the solutions on a 
higher level. The overall fitness is defined as the sum 
of fitness contributions of each building block. 

In this paper, we will consider a more general class 
of hierarchically decomposable functions than the one 
introduced by Watson et al. (1998) that allows any 
order and interpretation of every building block on 
each level. Furthermore, the fitness contribution of 
each considered vector of interpretations will not be 
assigned uniformly but an arbitrary function for each 
block of interpretations can be used. Finally, the in- 
dividual contributions will not be automatically mul- 
tiplied by the length of the input vector. 

Let us consider a hierarchical function on input vectors 
X = {Xq, . . . , X„_i) of n variables defined on L < n 
levels. The value of variable Xi is denoted by Xi. On 
each level i G {1, . . . , L— 1}, let us define mi functions 
that contribute to the overall fitness. On input, each 
of these functions gets a vector of building-block in- 
terpretations (meanings) from a lower level. On each 
level i there will be rrii recursively computed interpre- 
tations. 

The interpretations on a 0th level are simply the val- 
ues of input variables, i.e. wg^fc = x^ for all k. There 
are mo = n such interpretations. On level i > 0, the 
jth contributory function fij is defined on a subset of 



interpretations from the lower level, with the indices 
from Si^j C {0, . . . ,mi_i — 1}. These interpretations 
will be joint together to form a higher-level interpre- 
tation by function T^ j. We denote the vector of in- 
terpretations with the indices from Si,j by Vij, i.e. 
^i-j = {^i-i,fc|fc £ ^iyj}- The j'th interpretation on the 
level i, denoted by Vij, is then given by the recursive 
function 



Xn otherwise 



(1) 



where i G {0, ... ,L — 1}, and j G {0, . . . , rrij — 1}. 
A simple example of the recursive computation of the 
interpretations for a vector of n = 9 variables and 
L = 3 levels is shown in Figure 1. 

The total fitness is defined as the sum of functions 
defined on the subsets of interpretations that are in- 
terpreted together in order to get the interpretation on 
a higher level. For jfh interpretation T^.j, the corre- 
sponding function with the same inputs is denoted by 
fi^j. The overall value of the fitness function is thus 
given by 




X| x, ^3 



X-i As 



Figure 1: An example interpretation for n = 9 variables 
on L = 3 levels. 



X = (Xq, Xi, X2) of order 3 as 
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To solve HDFs, two issues must be addressed. The 
models must allow groups of variables to be merged 
into a single unit that will be further treated as a 
new ultimate variable. Moreover, niching becomes an 
important issue because in order to have enough ma- 
terial to combine, we need to preserve diversity; to 
combine solution from a certain level to form a solu- 
tion of a higher order, we want to have a sufficient 
number of the parts of solutions we want to combine. 
Niching methods were frequently discussed in recent 
work (Goldberg, 1989; Oei, Goldberg, & Chang, 1991; 
Mahfoud, 1995; Mengshoel & Goldberg, 1999). Here 
we focus on the modeling part, i.e. on how the mod- 
els to be used should look like and how these can be 
learned given the set of promising solutions. 

Example: 

The following function is defined by using bipolar 
fully deceptive functions of order 6. A bipolar func- 
tion of order 6 is constructed from a deceptive func- 
tion of order 3 which is defined on binary vectors 



where u is the number of one's in the input vector 
(string) X . The bipolar function of order 6 is defined 
on binary vectors of length 6 as 

j6bipolar\'^) J3deceptive\\'^ ^1/' 

The function is defined on L levels. The input vector 
is of size n = 6 . All interpretation functions will be 
defined in the same way and they will interpret a block 
of 6 bits according to the major occurrence of either 
bit (in case of tie, we interpret the block as 0), i.e. 



if M < 3 

1 otherwise 



where u is the number of ones in the input vector of 
interpretations (each of which is a binary number), 
i G {0, ... , L — 1}, and j G {0, . . . , rrij — 1}. The 
contributory functions fij simply return the value of 
the bipolar function f bipolar, i-e. 

Ji;j\^i:j ) Jhipolary,^) : 

where Vi^j is the input vector of interpretations (each 
of which is again a binary value) , and u is the number 
of ones in the input vector Vi^j. The function has two 
global optima in points 000 ... and 111 ... 1. Sim- 
ilarly as functions additively composed of a bipolar 
function, it has a great number of deceptive local op- 
tima. Moreover, the optima on the higher levels aggra- 
vate the deception of functions on each level. To scale 



each function according to their "importance" , mea- 
sured for instance by the number of input bits that 
affect its value, the function contribution can be mul- 
tiplied by a factor of 6^ , where j is the number of the 
level. 

4 PROBABILISTIC 

MODEL-BUILDING GENETIC 
ALGORITHMS 

Probabilistic model-building genetic algorithms (PM- 
BGAs), also called the estimation of distribution algo- 
rithms (Miihlenbein & Paafi, 1996), replace genetic re- 
combination of the genetic algorithms (GAs) (Holland, 
1975; Goldberg, 1989) by building an explicit model of 
promising solutions and using the constructed model 
to guide the further search. As models, probability 
distributions are used. For an overview of recent work 
on PMBGAs, see Pelikan, Goldberg, and Lobo (2000). 

The Bayesian optimization algorithm (BOA) (Pelikan, 
Goldberg, & Cantii-Paz, 1998) uses Bayesian networks 
to model promising solutions and subsequently guide 
the further search. In the BOA, the first population of 
strings is generated at random. From the current pop- 
ulation, the better strings are selected. Any selection 
method can be used. A Bayesian network that fits the 
selected set of strings is constructed. Any metric as 
a measure of quality of networks and any search algo- 
rithm can be used to search over the networks in order 
to maximize/minimize the value of the used metric. 
Besides the set of good solutions, prior information 
about the problem can be used in order to enhance 
the estimation and subsequently improve convergence. 
New strings are generated according to the joint distri- 
bution encoded by the constructed network. The new 
strings are added into the old population, replacing 
some of the old ones. 

As a model of the selected strings, a Bayesian net- 
work is used in the BOA. A Bayesian network is a 
directed acyclic graph with the nodes corresponding 
to the variables in the modeled data set (in our case, 
to the positions in the solution strings). Mathemati- 
cally, a Bayesian network encodes a joint probability 
distribution given by 



n-l 



piX)=l[p{X,\Ux,), 



(3) 



i=0 



where X = {Xq, . . . , Xn-i) is a vector of all the vari- 
ables in the problem, li-Xi is the set of parents of Xi 
in the network (the set of nodes from which there ex- 
ists an edge to Xi) and p(Xi\Ilxi) is the conditional 



probability of Xi conditioned on the variables Hxi ■ A 
directed edge relates the variables so that in the en- 
coded distribution, the variable corresponding to the 
terminal node will be conditioned on the variable cor- 
responding to the initial node. More incoming edges 
into a node result in a conditional probability of the 
corresponding variable with conjunctional condition 
containing all its parents. 

To construct the network given the set of selected solu- 
tions, various methods can be used. All methods have 
two basic components: a scoring metric which discrim- 
inates the networks according to their quality and the 
search algorithm which searches over the networks to 
find the one with the best scoring metric value. The 
BOA can use any scoring metric and search algorithm. 
In our recent experiments, we have used the Bayesian- 
Dirichlet metric (Heckerman, Geiger, & Chickering, 
1994). The complexity of the considered models was 
bounded by the maximum number of incoming edges 
into any node denoted by fc. To search the space of 
networks, a simple greedy algorithm was used due to 
its efficiency. For further details, see Pelikan, Gold- 
berg, and Cantii-Paz (1999). 

5 HIERARCHICAL MODEL 
BUILDING 

To hierarchically solve a problem, we need to incre- 
mentally find important low-order partial solutions 
and combine these to create the solutions of higher 
order. Starting with single bits (symbols of base al- 
phabet) , once we get top high-quality solutions of some 
order we simply treat these solutions as the building 
blocks to be used to construct solutions of higher or- 
der. In this fashion, the order of partial solutions we 
get gradually grows over time. 

5.1 HIERARCHICAL MODELS 

In order to adjust modeling to hierarchical problems, 
we will use models that, among estimating the joint 
distribution between single variables, also allow mul- 
tiple variables to be merged together and form a new 
variable. This variable will be further treated as a sin- 
gle unit. In this fashion the solutions of higher order 
can be formed by using groups (clusters) of variables 
as basic building blocks. 

The idea of clustering the input variables and treating 
each cluster as an intact building block comes from 
learning used in the extended compact genetic algo- 
rithm (ECGA) (Harik, 1999). For each group of vari- 
ables only instances that are in the modeled data set 
will be considered like in learning Bayesian networks 
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Figure 2: Models used in a) ECGA, b) BOA, c) hierarchical BOA (Huffman networks), and d) an alternative 
model based on using hidden variables. 



with local structure (Friedman & Goldszmidt, 1999). 
The clusters (groups) of variables are related as in clas- 
sical directed-acyclic-graph (DAG) Bayesian networks 
used in the original BOA algorithm. This class of hy- 
brid models was first introduced by Davies and Moore 
(1999) who called these models Hujfman networks. 

Let us, for example, at certain point in time, have three 
positions with only two values in the entire population: 
000 and 111. Then, instead of working with each of 
these positions separately, these can be merged into 
a single binary variable with two new values 0' and 
1', where 0' corresponds to 000 and 1' corresponds to 
111. In this fashion, both the model complexity as well 
as the model expressiveness improve. Moreover, by 
reducing the number of variables, the search for good 
networks becomes more efficient and accurate. Each 
group of merged variables represents an intact part of 
the solutions from lower-level that is to be treated as 



a single variable on a higher level. 

An example model with a few groups of variables is 
shown in Figure 2c. For comparison, similar examples 
of models in the BOA and ECGA are shown in parts 
a) and b) of the same figure. The use of Huffman net- 
works does not require sacrificing modeling generality 
as in the ECGA. All relationships expressed by DAG 
models can be covered. On the other side, the overly 
complex DAG models used in the original BOA can 
be significantly simplified by "crossing over" the two 
approaches. 

Similar reduction of total model complexity can be 
achieved by using hidden variables often used in 
Bayesian networks. In fact, using hidden variables 
is an alternative and more general approach to the 
problem of hierarchical model building. We believe 
that using these models would further improve model- 



building for problems of a very complex structure. A 
similar model to the one shown in Figure 2c, based on 
using hidden variables, is shown in Figure 2d. 

5.2 SCORING METRIC FOR HUFFMAN 
NETWORKS 

To learn a model of solutions on a certain level, we will 
use a combination of the learning methods used in the 
original Bayesian optimization algorithm (BOA) (Pe- 
likan, Goldberg, & Cantu-Paz, 1999), the extended 
compact genetic algorithm (ECGA) (Harik, 1999), and 
Bayesian networks with local structure (Friedman & 
Goldszmidt, 1999). To discriminate the networks, a 
minimum description length (MDL) metric will be 
used. The BDe metric with additional term preferring 
simpler networks (Friedman & Goldszmidt, 1999) can 
be used, too. However, simpler models must be pre- 
ferred to more complex ones, since the clusters tend to 
grow indefinitely and the boundary on the complexity 
of models can not be directly applied without weaken- 
ing the modeling capabilities on hierarchical problems. 

To store data according to a particular model, we need 
to store (1) the definition of clusters (groups) of vari- 
ables in the model, (2) the probabilistic relationships 
between the groups of variables (edges between the 
groups in the model), and (3) the data set (the set of 
selected solutions) compressed according to the model. 
Each variable (bit position) is in exactly one of the 
clusters. The description of data will contain the fol- 
lowing fields: (1) the number of clusters in the model, 
(2) an array of cluster definitions, and (3) the popula- 
tion compressed according to the model. 

In further text we will use the following notation: 
n denotes the number of variables; N denotes the 
number of instances in the modeled data set; m de- 
notes the number of clusters (groups of variables); 
G = (Go,... ,Gm-i) denotes the set of clusters G^; 
I Gil denotes the number of variables in G^; ||Gi|| de- 
notes the number of instances of variables Gf, 11, de- 
notes the set of parent groups of G^; jllil denotes the 
number of parent groups in 11^; and ||ni|| denotes the 
number of instances of the set of groups 11 . 

There can be at most n groups of variables, i.e. m < n, 
and therefore in order to store the number m of groups, 
at most log2 n bits can be used. 

The definition of each group contains (1) the size of 
the group, (2) the indices of the variables contained in 
the group, (3) the set of instances of this group, (4) the 
set of this group's parent identifiers, and (5) the set of 
conditional probabilities of the instances in this group 
given all the instances of its parent groups. There can 



be at most n variables in each group, and therefore 
the size of each group can be stored by using log2 n 
bits. This boundary could be further reduced by ana- 
lyzing the entire description at once. There are [iqi) 
possibilities to choose variables to form G^. Thus, to 
identify the set of variables in G,, we need to store only 
the order of this subset in some ordering of all possi- 
ble subsets of this size, i.e. we need at most log2 {\q.\) 
bits. Assuming that we use binary variables, the set of 
instances of G^ can be stored by using log2 2' '' = |Gj| 
bits for the number of instances and |Gi|.||Gi|| bits for 
the specification of all bits in these instances. 

Each group can have at most n — I parents in the 
network. Thus, the number of parents can be stored 
by using log2(n — 1) bits. The number of bits needed 
to store the components of Hi is log2 (i^i)- 

To store conditional probabilities for Gi , we will store 
a frequency of each combination of instances of the 
variables in Gi and its parents. There are at most 



\Gi 



in,: 



possible instances. However, this number might be 
further reduced by using local structures (Friedman & 
Goldszmidt, 1999) or considering only instances that 
really appear in the modeled data set. Each frequency 
can be stored in 0.51og2 N bits with a sufficient degree 
of accuracy (Friedman & Yakhini, 1996). Thus, to 
store the conditionals corresponding to Gi, we need at 
most 



\Gi\log^N 



n 



|G,| 



1) 



bits, since the last frequency can be computed from 
the remaining ones. 

To store the data compressed according to the above 
model, we need at most 

|G|-1 

-N ^ ^ p{gi,Tri)logp{gi\'Ki) 

bits (Friedman & Goldszmidt, 1999), where the inner 
sum runs over all instances gi and tt^ of variables in 
Gi and H^ respectively, p{gi,'Ki) is the probability of 
the instance with the variables in Gi and H^ set to 
gi and Hi respectively, and p{gi\TTi) is the conditional 
probability of the variables in Gi set to gi given that 
the variables in H^ are set to tt^. 

The overall description length is then computed as the 
sum of all terms computed above. The lower the met- 
ric, the better the model. 



5.3 BUILDING A HUFFMAN NETWORK 

A method for building Huffman networks for com- 
pression of large data sets was presented in Davies 
and Moore (1999). This method proceeds similarly 
as other search methods commonly used for learning 
Bayesian networks by incrementally performing ele- 
mentary graph operations on the model to improve 
the value of the scoring metric. This algorithm is of- 
ten used for its efficiency. A general scheme of the 
greedy search method used in the original BOA as well 
as in Davies and Moore (1999) follows: 

(1) Initialize the network (to an empty, random, or 
the best network from the last generation). 

(2) Pick an elementary graph operation that improves 
the score of the current network the most. 

(3) If there is such operation, perform it, and go to 
step 2. 

(4) If no operation improves the score, finish. 

In addition to usually used operations as edge addi- 
tion, edge removal, and edge reversal, we can use a new 
operation that can either (1) join two of the groups 
of variables to form a single cluster or (2) move one 
variable from one cluster to another one (and deleting 
clusters that has become empty, if any). In (Davies & 
Moore, 1999), the second operation was used. In both 
cases, the conflicts appearing with existence of cycles 
must be resolved. When joining two groups, the edges 
can be either conservatively rearranged so that only 
edges that coincided with both of the groups will be 
considered or that all edges to and from either of the 
groups will be considered, if possible. 

6 FUTURE WORK 

We are currently implementing the extended hierarchi- 
cal BOA (hBOA) in order to test the algorithm on var- 
ious hierarchically decomposable problems. In order 
to analyze the performance of our algorithm and com- 
pare it to alternative methods, a set of test problems 
will be designed. The results of our recent research at 
the Illinois Genetic Algorithms Laboratory, focusing 
on various aspects of problem difficulty, will be used 
in order to design a rigorous test-suite for methods 
that approach the problem in a hierarchical fashion. 

Alternative solutions to the problem of hierarchical 
modeling will be compared on the designed test-suite. 
One of the alternatives, based on using hidden vari- 
ables, was outlined above. There are other methods 
that may be used and the question of suitability of 
each approach still remains open. 



The paper did not discuss niching, which becomes a 
very important issue when solving hierarchical prob- 
lems. Although it may not be necessary to use niching 
for solving additively decomposable problems, when 
solving hierarchical problems it becomes a necessity 
since many alternative low-order solutions should be 
preserved before we can find the best way of juxta- 
posing these. In other words, the importance of the 
notion of minimal sequential non-inferior BB of a so- 
lution dominates the one of the minimal sequential su- 
perior BB of a solution (Goldberg, 1997). Therefore, 
one of the most important issues of future research 
on this topic should include the use of niching in the 
methods based on probabilistic modeling, as the BOA 
algorithm. More advanced niching techniques can be 
designed by using the constructed model as a hint on 
the structure of the problem at hand. 

7 SUMMARY AND CONCLUSIONS 

Recently, Watson et al. (1998) suggested that the sim- 
ple genetic algorithm can solve some hierarchically de- 
composable problems quite efficiently. On the other 
side, anomalous behavior of the simple GA on prob- 
lems with rewards for various combinations of building 
block was observed (Forrest & Mitchell, 1993). We be- 
lieve that the algorithms that use Huffman networks 
to model promising solutions will offer an efficient and 
very robust method to solve the class of hierarchical 
problems. Even though the hierarchical BOA is aimed 
to solve hierarchically decomposable problems, we ex- 
pect that its overall performance on other problems 
will also improve. In fact, by simplifying the used class 
of models without sacrificing their generality, modeling 
capabilities of the BOA should improve what should 
result in that the BOA will solve a more general class 
of problems efficiently and reliably. 

The paper discussed three major issues. It provided 
the reasons for approaching problems in a hierarchi- 
cal fashion. The class of hierarchically decompos- 
able problems which extends additively decomposable 
problems in order to test for hierarchical capabilities of 
optimization algorithms was defined. Possible exten- 
sions of the original Bayesian optimization algorithm 
were outlined, and the direction of future research in 
the discussed area was drawn. 
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